Abstract:Partially Observable Markov Decision Processes (POMDPs) model decision making under uncertainty. While there are many approaches to approximately solving POMDPs, we aim to address the problem of learning such models. In particular, we are interested in a subclass of POMDPs wherein the components of the model, including the observation function, reward function, transition function, and initial state distribution function, can be modeled as low-complexity probabilistic graphical models in the form of a short probabilistic program. Our strategy to learn these programs uses an LLM as a prior, generating candidate probabilistic programs that are then tested against the empirical distribution and adjusted through feedback. We experiment on a number of classical toy POMDP problems, simulated MiniGrid domains, and two real mobile-base robotics search domains involving partial observability. Our results show that using an LLM to guide in the construction of a low-complexity POMDP model can be more effective than tabular POMDP learning, behavior cloning, or direct LLM planning.
Abstract:MLLMs have recently become a focal point in the field of artificial intelligence research. Building on the strong capabilities of LLMs, MLLMs are adept at addressing complex multi-modal tasks. With the release of GPT-4, MLLMs have gained substantial attention from different domains. Researchers have begun to explore the potential of MLLMs in the medical and healthcare domain. In this paper, we first introduce the background and fundamental concepts related to LLMs and MLLMs, while emphasizing the working principles of MLLMs. Subsequently, we summarize three main directions of application within healthcare: medical reporting, medical diagnosis, and medical treatment. Our findings are based on a comprehensive review of 330 recent papers in this area. We illustrate the remarkable capabilities of MLLMs in these domains by providing specific examples. For data, we present six mainstream modes of data along with their corresponding evaluation benchmarks. At the end of the survey, we discuss the challenges faced by MLLMs in the medical and healthcare domain and propose feasible methods to mitigate or overcome these issues.
Abstract:With the increasing use of surgical robots in clinical practice, enhancing their ability to process multimodal medical images has become a key research challenge. Although traditional medical image fusion methods have made progress in improving fusion accuracy, they still face significant challenges in real-time performance, fine-grained feature extraction, and edge preservation.In this paper, we introduce TTTFusion, a Test-Time Training (TTT)-based image fusion strategy that dynamically adjusts model parameters during inference to efficiently fuse multimodal medical images. By adapting the model during the test phase, our method optimizes the parameters based on the input image data, leading to improved accuracy and better detail preservation in the fusion results.Experimental results demonstrate that TTTFusion significantly enhances the fusion quality of multimodal images compared to traditional fusion methods, particularly in fine-grained feature extraction and edge preservation. This approach not only improves image fusion accuracy but also offers a novel technical solution for real-time image processing in surgical robots.
Abstract:Visible-Infrared Person Re-identification (VIReID) aims to match visible and infrared pedestrian images, but the modality differences and the complexity of identity features make it challenging. Existing methods rely solely on identity label supervision, which makes it difficult to fully extract high-level semantic information. Recently, vision-language pre-trained models have been introduced to VIReID, enhancing semantic information modeling by generating textual descriptions. However, such methods do not explicitly model body shape features, which are crucial for cross-modal matching. To address this, we propose an effective Body Shape-aware Textual Alignment (BSaTa) framework that explicitly models and utilizes body shape information to improve VIReID performance. Specifically, we design a Body Shape Textual Alignment (BSTA) module that extracts body shape information using a human parsing model and converts it into structured text representations via CLIP. We also design a Text-Visual Consistency Regularizer (TVCR) to ensure alignment between body shape textual representations and visual body shape features. Furthermore, we introduce a Shape-aware Representation Learning (SRL) mechanism that combines Multi-text Supervision and Distribution Consistency Constraints to guide the visual encoder to learn modality-invariant and discriminative identity features, thus enhancing modality invariance. Experimental results demonstrate that our method achieves superior performance on the SYSU-MM01 and RegDB datasets, validating its effectiveness.
Abstract:We propose Cabbage, a differential growth framework to model buckling behavior in 3D open surfaces found in nature-like the curling of flower petals. Cabbage creates high-quality triangular meshes free of self-intersection. Cabbage-Shell is driven by edge subdivision which differentially increases discretization resolution. Shell forces expands the surface, generating buckling over time. Feature-aware smoothing and remeshing ensures mesh quality. Corrective collision effectively prevents self-collision even in tight spaces. We additionally provide Cabbage-Collision, and approximate alternative, followed by CAD-ready surface generation. Cabbage is the first open-source effort with this calibre and robustness, outperforming SOTA methods in its morphological expressiveness, mesh quality, and stably generates large, complex patterns over hundreds of simulation steps. It is a source not only of computational modeling, digital fabrication, education, but also high-quality, annotated data for geometry processing and shape analysis.
Abstract:Ophthalmic diseases pose a significant global health challenge, yet traditional diagnosis methods and existing single-eye deep learning approaches often fail to account for binocular pathological correlations. To address this, we propose DMS-Net, a dual-modal multi-scale Siamese network for binocular fundus image classification. Our framework leverages weight-shared Siamese ResNet-152 backbones to extract deep semantic features from paired fundus images. To tackle challenges such as lesion boundary ambiguity and scattered pathological distributions, we introduce a Multi-Scale Context-Aware Module (MSCAM) that integrates adaptive pooling and attention mechanisms for multi-resolution feature aggregation. Additionally, a Dual-Modal Feature Fusion (DMFF) module enhances cross-modal interaction through spatial-semantic recalibration and bidirectional attention, effectively combining global context and local edge features. Evaluated on the ODIR-5K dataset, DMS-Net achieves state-of-the-art performance with 80.5% accuracy, 86.1% recall, and 83.8% Cohen's kappa, demonstrating superior capability in detecting symmetric pathologies and advancing clinical decision-making for ocular diseases.
Abstract:Autoregressive (AR) models, long dominant in language generation, are increasingly applied to image synthesis but are often considered less competitive than Diffusion-based models. A primary limitation is the substantial number of image tokens required for AR models, which constrains both training and inference efficiency, as well as image resolution. To address this, we present Token-Shuffle, a novel yet simple method that reduces the number of image tokens in Transformer. Our key insight is the dimensional redundancy of visual vocabularies in Multimodal Large Language Models (MLLMs), where low-dimensional visual codes from visual encoder are directly mapped to high-dimensional language vocabularies. Leveraging this, we consider two key operations: token-shuffle, which merges spatially local tokens along channel dimension to decrease the input token number, and token-unshuffle, which untangles the inferred tokens after Transformer blocks to restore the spatial arrangement for output. Jointly training with textual prompts, our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis in a unified next-token prediction way while maintaining efficient training and inference. For the first time, we push the boundary of AR text-to-image generation to a resolution of 2048x2048 with gratifying generation performance. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale human evaluations also demonstrate our prominent image generation ability in terms of text-alignment, visual flaw, and visual appearance. We hope that Token-Shuffle can serve as a foundational design for efficient high-resolution image generation within MLLMs.
Abstract:Goal-oriented navigation presents a fundamental challenge for autonomous systems, requiring agents to navigate complex environments to reach designated targets. This survey offers a comprehensive analysis of multimodal navigation approaches through the unifying perspective of inference domains, exploring how agents perceive, reason about, and navigate environments using visual, linguistic, and acoustic information. Our key contributions include organizing navigation methods based on their primary environmental reasoning mechanisms across inference domains; systematically analyzing how shared computational foundations support seemingly disparate approaches across different navigation tasks; identifying recurring patterns and distinctive strengths across various navigation paradigms; and examining the integration challenges and opportunities of multimodal perception to enhance navigation capabilities. In addition, we review approximately 200 relevant articles to provide an in-depth understanding of the current landscape.
Abstract:Video Anomaly Detection~(VAD) focuses on identifying anomalies within videos. Supervised methods require an amount of in-domain training data and often struggle to generalize to unseen anomalies. In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse events. Therefore, we propose EventVAD, an event-aware video anomaly detection framework that combines tailored dynamic graph architectures and multimodal LLMs through temporal-event reasoning. Specifically, EventVAD first employs dynamic spatiotemporal graph modeling with time-decay constraints to capture event-aware video features. Then, it performs adaptive noise filtering and uses signal ratio thresholding to detect event boundaries via unsupervised statistical features. The statistical boundary detection module reduces the complexity of processing long videos for MLLMs and improves their temporal reasoning through event consistency. Finally, it utilizes a hierarchical prompting strategy to guide MLLMs in performing reasoning before determining final decisions. We conducted extensive experiments on the UCF-Crime and XD-Violence datasets. The results demonstrate that EventVAD with a 7B MLLM achieves state-of-the-art (SOTA) in training-free settings, outperforming strong baselines that use 7B or larger MLLMs.
Abstract:3D captioning, which aims to describe the content of 3D scenes in natural language, remains highly challenging due to the inherent sparsity of point clouds and weak cross-modal alignment in existing methods. To address these challenges, we propose 3D CoCa, a novel unified framework that seamlessly combines contrastive vision-language learning with 3D caption generation in a single architecture. Our approach leverages a frozen CLIP vision-language backbone to provide rich semantic priors, a spatially-aware 3D scene encoder to capture geometric context, and a multi-modal decoder to generate descriptive captions. Unlike prior two-stage methods that rely on explicit object proposals, 3D CoCa jointly optimizes contrastive and captioning objectives in a shared feature space, eliminating the need for external detectors or handcrafted proposals. This joint training paradigm yields stronger spatial reasoning and richer semantic grounding by aligning 3D and textual representations. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that 3D CoCa significantly outperforms current state-of-the-arts by 10.2% and 5.76% in CIDEr at 0.5IoU, respectively. Code will be available at https://github.com/AIGeeksGroup/3DCoCa.